Skip to content

OpenCL: add initial FA support #14987

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Aug 16, 2025
Merged

OpenCL: add initial FA support #14987

merged 6 commits into from
Aug 16, 2025

Conversation

rmatif
Copy link
Collaborator

@rmatif rmatif commented Jul 31, 2025

This PR introduces F16/F32 FA support for the OpenCL backend. It has been extremely challenging to achieve good performance on this kind of hardware, but I believe it is now decent enough to serve as a baseline that we can further iterate on. I also believe there is room for improvement for tg

Results on Adreno 830:

model size params backend ngl fa test t/s
llama 1B F16 2.30 GiB 1.24 B OpenCL 99 0 pp512 198.69 ± 0.59
llama 1B F16 2.30 GiB 1.24 B OpenCL 99 0 tg128 21.88 ± 0.85
llama 1B F16 2.30 GiB 1.24 B OpenCL 99 1 pp512 274.75 ± 1.22
llama 1B F16 2.30 GiB 1.24 B OpenCL 99 1 tg128 21.58 ± 0.39

Adreno 750:

model size params backend ngl fa test t/s
llama 1B F16 2.30 GiB 1.24 B OpenCL 99 0 pp512 139.96 ± 0.51
llama 1B F16 2.30 GiB 1.24 B OpenCL 99 0 tg128 19.70 ± 0.11
llama 1B F16 2.30 GiB 1.24 B OpenCL 99 1 pp512 151.22 ± 0.85
llama 1B F16 2.30 GiB 1.24 B OpenCL 99 1 tg128 17.94 ± 0.15

@rmatif rmatif requested review from max-krasnyansky and lhez July 31, 2025 12:35
@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning OpenCL Issues specific to the OpenCL backend labels Jul 31, 2025
@lhez
Copy link
Collaborator

lhez commented Aug 1, 2025

@rmatif Very cool, thank you!

@lhez
Copy link
Collaborator

lhez commented Aug 10, 2025

Sorry, got distracted during the past week. Will come back to this asap.

@lhez
Copy link
Collaborator

lhez commented Aug 15, 2025

It seems to help small models like qwen2.5-0.5b,

qwen2.5-0.5b-Q4_0

A750

ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: commit unknown Compiler E031.45.02.16

model size params backend ngl fa test t/s
qwen2 1B Q4_0 265.25 MiB 494.03 M OpenCL 99 1 pp1024 634.68 ± 2.06
qwen2 1B Q4_0 265.25 MiB 494.03 M OpenCL 99 1 tg128 35.91 ± 4.13
qwen2 1B Q4_0 265.25 MiB 494.03 M OpenCL 99 0 pp1024 283.68 ± 0.56
qwen2 1B Q4_0 265.25 MiB 494.03 M OpenCL 99 0 tg128 34.57 ± 2.91

A830

ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: 0800.35 Compiler E031.47.18.28

model size params backend ngl fa test t/s
qwen2 1B Q4_0 265.25 MiB 494.03 M OpenCL 99 1 pp1024 1079.08 ± 1.04
qwen2 1B Q4_0 265.25 MiB 494.03 M OpenCL 99 1 tg128 97.31 ± 0.49
qwen2 1B Q4_0 265.25 MiB 494.03 M OpenCL 99 0 pp1024 388.14 ± 15.41
qwen2 1B Q4_0 265.25 MiB 494.03 M OpenCL 99 0 tg128 95.84 ± 0.94

@lhez
Copy link
Collaborator

lhez commented Aug 16, 2025

Current implementation works well for small models (e.g., qwen2.5-0.5B), significantly improving pp performance. For larger models, larger configs (e.g., {128, 128, 32, 32} for qwen2.5-1.5B) are used; these configs seem to result in spilling into global memory.

We will use this implementation as the baseline and do further investigations and improvements.

@lhez lhez merged commit 912ff8c into ggml-org:master Aug 16, 2025
46 of 47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning OpenCL Issues specific to the OpenCL backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants